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Abstract 

Scientific research often involves testing more than one hypothesis at a 
time, which can inflate the probability that a Type I error (false discovery) 
will occur. To prevent this Type I error inflation, adjustments can be made 
to the testing procedure that compensate for the number of tests. Yet many 
researchers believe that such adjustments are inherently unnecessary if the 
tests were “planned” (i.e., if the hypotheses were specified before the study 
began). This longstanding misconception continues to be perpetuated in 
textbooks and continues to be cited in journal articles to justify disregard 
for Type I error inflation. I critically evaluate this myth and examine its 
rationales and variations. To emphasize the myth’s prevalence and 
relevance in current research practice, I provide examples from popular 
textbooks and from recent literature. I also make recommendations for 
improving research practice and pedagogy regarding this problem and 
regarding multiple testing in general. 
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1. Background 


For Librarians 


1.1. Null Hypothesis Testing 

The null hypothesis is the hypothesis that a particular independent/grouping variable has no 
effect on (or no association with) a particular outcome variable. Often, the null hypothesis is 
the hypothesis that the researcher’s prediction is wrong. For instance, if a researcher predicts 
that a particular treatment reduces depression in humans (on average), then the null hypothesis 
is that the treatment does not work. If a researcher predicts that a certain genetic allele is 
associated with Alzheimer’s disease, then the null hypothesis is that the allele has no 
association with Alzheimer’s disease. However, the null hypothesis applies even when the 
researcher makes no official prediction, so long as there is a possibility that there is no 
effect/association. 

Because hypotheses typically cannot be tested on the entire population of interest (e.g., by 
analyzing the genomes of every living human being), hypotheses are instead tested on a finite 
sample of the population. Thus, a researcher never knows with 100% certainty whether an 
ostensible effect observed in the sample actually applies to the population or whether it is due 
to “chance.” For instance, despite random assignment, a treatment group may happen to be, on 
average, more predisposed to improve than the subjects in a placebo group. 
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In conventional (ffequentist) hypothesis testing, the researcher addresses this inevitable 
uncertainty by computing a p- value based on the observed data. Roughly speaking, the p- value 
represents the theoretical probability that the observed effect (or a larger effect) would occur 
by chance if the null hypothesis were true. Once computed, the p -value is then compared to a 
predesignated critical value called the alpha level (a), such that if p < a, then the null 
hypothesis may be rejected. Once the null hypothesis is rejected, the observed effect may be 
declared statistically significant, and a corresponding decision can be made (e.g., a treatment 
is recommended, an association is claimed, a follow-up study is pursued, etc.). 

A statistically significant result that occurs when the null hypothesis is true is called a Type I 
error. Hence, a represents the maximum Type I error rate that the researcher is willing to 
tolerate. For example, among tests that use the conventional .05 alpha level, a Type I error is 
allowed to occur up to 5% of the time. 

Type I error rates can be reduced by making alpha levels lower (i.e., more stringent), but only 
at the expense of statistical power (the likelihood of producing statistically significant results 
when the null hypothesis is false). Because frequently the goal of research is to 
discover/demonstrate some effect or association, and because researchers typically face 
considerable pressure to find statistical significance (e.g., in order to get published), 
researchers are often reluctant to sacrifice statistical power. 


Another way to reduce the effective Type I error rate is to require that significant results be 
promptly replicated by a second study with a completely new sample. In terms of the effective 
Type I error rate, making statistical significance conditional on two independent tests, each at 

a, is equivalent to conducting a single test at a 2 (e.g., at .0025 when nominal a = .05). 
However, immediate full-scale replications are rare, largely for practical reasons. More 
commonly, significant results are reported shortly after they are obtained, rather than withheld 
pending an independent corroboration. 







1.2. The Problem of Multiple Testing 


The Type I error rate is fairly straightforward when there is only one test. However, scientific 
research often involves testing more than one hypothesis at a time, for example, when 
evaluating more than one mean difference or more than one correlation. The resulting problem 
of multiplicity (multiple testing) is known to many researchers: Every hypothesis test added to 
a data analysis carries additional potential for error, so the testwise alpha levels (i.e., the 
nominal alpha levels at which tests are conducted) can substantially understate the effective 
Type I error rate for the investigation as a whole. For example, when two tests are conducted, 
each at the .05 level, the probability that at least one of them would produce a Type I error if 
both hypotheses were true may be as high as .10, though the exact probability depends on the 
statistical dependence (e.g., the correlation) between the tests. 

Thus, if Type I errors are to be controlled (i.e., contained at a given rate), then adjustments 
should be made to compensate for the number of tests in th efamily (the set of tests being 
examined). These adjustments, sometimes called “corrections,” typically involve reducing 
testwise alpha levels (or equivalently, adjusting p -values upwards), thereby reducing statistical 
power. However, multiplicity adjustments also apply to the widths of confidence intervals, 
even when p -values are not used (Benjamini & Yekutieli, 2005; Dunn, 1961; Hsu, 1996; 
Miller, 1981). Confidence intervals are computationally related to null hypothesis tests, but are 
used to make inferences about the estimated effect sizes, rather than merely about whether the 
effects are zero or nonzero. Note that although this article generally discusses multiplicity in 
terms of null hypothesis testing, the same principles of multiplicity are relevant to computing 
confidence intervals for effect size estimates. 

1.3. Ways to Define the Type I Error Rate in Multiple Testing 

Many multiple testing procedures (i.e., methods of adjustment for multiplicity) have been 
devised. Which multiple testing procedure is preferable for which situation is a complex 
question that cannot be definitively answered, but using no method at all is clearly a poor 
default strategy. In any case, before choosing a multiple testing procedure, one should first 
decide which error rate is relevant for the given investigation (Benjamini, 2010). Many error 
rates have been defined, most notably the following three, presented here in order of 
decreasing stringency. Note that each of these three error rates is equal to the testwise alpha 
level when there is only one test, but can inflate as the number of tests increases. 

1.3.1. Per-Family Type I Error Rate (PFER) 

The PFER (Tukey, 1953) is the expected number of Type I errors per family. Note that the 
“expected number” is a long-term average, not an upper bound on the number of Type I errors 
likely to occur in any single investigation. The PFER is typically controlled using the 
Bonferroni procedure, which can be applied to any set of p -values by setting the testwise alpha 
level at a / m, where a is the designated overall alpha level and m is the number of tests. The 
Bonferroni procedure can be similarly applied to confidence intervals, by expanding the width 
of each interval at the nominal 1 - a confidence level to what it would be at the 1 - a / m 
confidence level (Dunn, 1961). 


1.3.2. Familywise Type I Error Rate (FWER) 


The FWER (Tukey, 1953) is the probability that at least one Type I error will occur in a given 
family. Thus, FWER control is more permissive of Type I errors than PFER control is, because 
multiple simultaneous errors do not add to the tally of “at least one Type I error” any more 
than a single error does. However, in many cases, the FWER is only negligibly lower than the 
PFER, especially when the number of tests is small and the dependency among the tests is low 
(because simultaneous Type I errors are relatively rare under such conditions). 

The Bonferroni procedure is often described as controlling the FWER, which it does, because 
any procedure that controls the PFER at a controls the FWER at < a. However, by sacrificing 
strict PFER control, other methods of FWER control (e.g., Holm, 1979; Hommel, 1988) can 
provide more statistical power; see Dmitrienko, Tamhane, and Bretz (2010) for a litany of such 
methods, each with its own advantages and limitations. Thus, given the multitude ofFWER- 
controlling procedures available, the oft-lamented “conservatism” of the Bonferroni procedure 
is not an adequate excuse for forgoing FWER control altogether. 

It is important to distinguish FWER control from “weak FWER control,” which is FWER 
control that is reliable when all null hypotheses are true, but can fail when one or more null 
hypotheses are false. Weak FWER control is typically achieved by making several 
simultaneous tests (none of which are adjusted) conditional on the statistical significance of a 
single omnibus test (e.g., ANOVA or MANOVA), a technique that is sometimes called 
“protected” testing. Because this approach does not reliably control Type I error (except in 
certain circumstances), it has very limited applicability (Benjamini, 2010; Goeman & Solari, 
2014; Hsu, 1996; Tamhane, 2009). In fact, most methods of Type I error control do not require 
omnibus tests at all (Dmitrienko, Tamhane, & Bretz, 2010). 

1.3.3. False Discovery Rate (FDR) 

The term false discovery is generally synonymous with Type I error, but the term FDR refers 
to one particular form of Type I error rate (Benjamini & Hochberg, 1995). Roughly speaking, 
the FDR is the expected proportion of statistically significant tests that are Type I errors in a 
given family (except when all null hypotheses are true, in which case the FDR is equivalent to 
the FWER). Note that the expected proportion is a long-term average, not an upper bound on 
the proportion of statistically significant tests likely to be false in any single investigation. 
Note also that the computation of this long-term average defines the proportion as zero when 
no tests are significant. 

Any procedure that controls the FWER at a controls the FDR at < a, but by sacrificing strong 
FWER control, dedicated FDR-controlling procedures can provide more statistical power. 
FDR control can be useful when there are numerous tests and allowing some Type I errors is 
not very harmful (e.g., when screening for associations to be examined in subsequent studies). 
However, FDR control is not sufficient when stronger, more confirmatory inference is required 
(Benjamini, 2010; Dmitrienko, Tamhane, & Bretz, 2010). Note also that the relevance of the 
FDR is limited when hypotheses have unequal likelihoods, because tests that are known to 
produce low p -values (call them “ringers”) can drive down the FDR, thereby allowing tests 
with higher />values to become statistically significant (Firmer & Roters, 2001). 

1.4. Scientific Harm Caused By Type I Errors 

Subjecting hypotheses to rigorous testing is a cornerstone of the scientific method. If false 


discoveries were inconsequential, then researchers’ speculations and intuitions could simply be 
declared correct without being tested at all. However, false discoveries can cause “scientific 
harm,” for example, by impeding scientific progress, misdirecting scientific understanding, 
impairing scientific credibility through poor replicability (reproducibility of results), and 
causing resources to be squandered on spurious findings. Hence, although Type I errors cannot 
be eliminated, they should be controlled. 

Of course, “missed true discoveries” (Type II errors) can be scientifically harmful in their own 
way, which is why it is important to use sample sizes that provide adequate statistical power. 
However, Type II errors are arguably more likely to be corrected than Type I errors, because 
they tend to be less reinforced by factors such as confirmation bias and publication bias, and 
because promising leads are unlikely to be abandoned without a second look simply because 
statistical significance was missed by some nominal amount; note that a failure to reject the 
null hypothesis does not necessarily constitute an acceptance of the null hypothesis. Moreover, 
as Ryan (1962) opined regarding the comparative threats of Type I and Type II errors in 
psychology research, “I believe that it is less important if we miss some very small effect of a 
variable, than it is to claim that the variable has an effect (of unspecified magnitude) which 
does not actually exist at all” (p. 305). Note also that uncontrolled Type I error rates threaten 
the credibility even of true discoveries, as statistical significance ceases to be meaningful when 
it is too easily achieved by chance. 

By limiting the rate at which false discoveries are allowed to occur, hypothesis testing provides 
some protection against the scientific harm caused by false discoveries. The purpose of 
multiplicity adjustment is simply to preserve that limit when there are multiple simultaneous 
opportunities for scientific harm. Hence, multiplicity adjustments should account for each 
opportunity for scientific harm, that is, each test that would constitute a discovery on its own if 
statistically significant. The number of potential discoveries in a given study is often 
straightforward, but sometimes subjective. As the following two examples illustrate, whether 
certain tests qualify as potential discoveries depends on how the results might be used: 

First, consider a 2 (teaching method: old, new) x 2 (student gender: male, female) factorial 
design with three planned orthogonal contrasts: main effect for teaching method, main effect 
for gender, and an interaction, with some measure of student achievement as the dependent 
variable. Imagine that the researchers will publish their findings if any of the three contrasts 
are statistically significant. In this case, the probability of publishing a false discovery can be 
nearly three times the testwise alpha level, so adjustment for multiplicity is advisable. 

On the other hand, imagine that for the same 2x2 design and the same three contrasts, the 
goal of the study is to get approval to replace the old teaching method with the new one, that 
is, the goal is to demonstrate a main effect for teaching method. Imagine that the other 
contrasts are merely descriptive (e.g., to verify an assumption that student gender is irrelevant 
to achievement in the course). Multiplicity adjustment is arguably not necessary in this case, 
because the opportunity for a harmful false discovery is confined to a single contrast: main 
effect for teaching method. A main effect for gender could make an interesting refinement of 
the results, and a method-gender interaction could be a relevant caveat to the results, but only 
a main effect for teaching method has the potential to generate approval for the new method (in 
fact, a method-gender interaction might even prevent approval). 

Clearly, the potential for harm caused by Type I and Type II errors must be evaluated on a 


case-by-case basis. There are other subjectivities to consider as well. For example, researchers 
may disagree on whether a particular study containing three experiments should be considered 
to have three distinct families of hypotheses, or whether all the tests in the study should be 
considered as a single family and adjusted accordingly. And even in the absence of 
multiplicity, researchers may disagree on what overall alpha level is appropriate, as there is no 
particular scientific specialness to the .05 level and some questions presumably require more 
confident answers than others. 

However, the fact that there is subjectivity regarding an issue does not mean that all statements 
about that issue are equally valid. For example, it would not be sensible to say, “Because there 
is subjectivity regarding what alpha level is appropriate, it is therefore appropriate to test all 
my hypotheses at a = .99.” Nor is it sensible to say, “Because there is subjectivity regarding 
how multiplicity should be handled, it is therefore appropriate to disregard multiplicity.” On 
the contrary, subjective issues frequently require more thoughtful consideration than objective 
issues. 

2. Planned-Hypotheses Exemption From Multiplicity Adjustment (PHEMA) 

As numerous authors have noted (e.g., Anderson, 2014; Glickman, Rao, & Shultz, 2014; Ha & 
Ha, 2012; Iacobucci, 2001; O’Keefe, 2003; Rutherford, 2011; Ryan, 1959, 1995; Sheskin, 
2011; Stangor, 2015; Stanley, 1957; Steinfatt, 2006; Streiner, 2015; Thompson, 1994; Tucker, 
1991; Weiss, 2006), many in the applied sciences consider it appropriate not to adjust for 
multiplicity if the tests were planned (i.e., if the hypotheses were specified a priori , meaning 
before the study began). In fact, researchers have frequently defended their unadjusted tests 
explicitly on the basis that the tests were planned (see Table 1 fora few examples). The belief 
that stating one’s hypotheses a priori eliminates or excuses Type I error inflation—a belief this 
article refers to as the planned-hypotheses exemption from multiplicity adjustment (PHEMA)— 
has no apparent mathematical or scientific basis. Yet the myth continues to be perpetuated. For 
example, consider the following passage from a popular textbook: 

With planned comparisons, we do not correct for the higher probability of Type I 
error that arises due to multiple comparisons, as is done with the post hoc methods 
. . . Because planned comparisons do not involve correcting for the higher 
probability of Type I error, planned comparisons have higher power than post hoc 
comparisons.” (Pagano, 2013, p. 422; emphasis in original) 

See Tucker (1991) and Wang (1993) for similar statements. Note that although PHEMA does 
not come with an empirical justification, it does come with a seductive offer: more statistical 
power. 


Table 1. Defense of Unadjusted Multiple Testing 


Study 

Journal 

Excerpt 

Cachelin et al., 2014, p. 

Cultural Diversity and Ethnic 

“The t-tests were planned and hypothesis driven, 

453 

Minority Psychology 

therefore no adjustment for multiple testing was 
employed.” 

Fenesi et al., 2014, p. 257 

The Journal of Experimental 
Education 

“All post hoc t tests were Bonferroni corrected to p [sic] 

< .05; a priori planned comparisons were not (Perenger 
[sic], 1998; Rothman, 1990).” 

Glaus et al., 2014, p. 39 

Journal of Psychiatric Research 

“P-values were not adjusted for multiple testing because 
the hypothesized associations between mental 
disorders and inflammatory markers were specified a 












priori.” 

Holmes et al., 2014, p. 3 

Mutation Research: Fundamental 
and Molecular Mechanisms of 
Mutagenesis 

“Since all comparisons among means were considered 
to be of substantive interest a priori, no adjustment for 
multiple comparisons was incorporated into the 
analysis.” 

Krane-Gartiser et al., 2014, 

p. 8 

PLoS ONE 

“A correction for multiple comparisons adjusting for the 
total number of statistical tests has not been done since 
the analyses were planned before they were 
conducted.” 

MacDonald & Barry, 2014, 
p. 103 

International Journal of 
Psychophysiology 

“Since all contrasts were planned and there were no 
more of them than the degrees of freedom for effect, no 
Bonferroni-type adjustment to a was necessary.” 

Pataki, Metz, & Pakulski, 
2014, p. 253 

Journal of Early Childhood Literacy 

“No correction for multiplicity was employed as our a 
priori intent was to test each variable independently.” 

Pyra et al., 2014, p. 1133 

Journal of General Internal 

Medicine 

“All analyses were planned a priori; therefore, p values 
were not adjusted for multiple comparisons.” 

Stenfors et al., 2014, p. 5 

BMC Psychology 

“Since the significance tests were used to evaluate a 
set of a priori hypotheses, individual test results were 
not corrected for multiple significance testing.” 


3. Possible Origins of PHEMA 

The term planned comparisons is often used in the context of ANOVA-based analyses, but 
more generally can refer to any tests of hypotheses (sometimes called specific hypotheses ) that 
were generated a priori from the original research questions. Planned comparisons are 
distinguished from unplanned comparisons, which are performed without any a priori 
expectation, for example, when relationships that were not previously considered interesting 
are detected in the data. Note that the number of unplanned comparisons implicitly includes 
not only those that are reported, but also any comparison that would have been reported had it 
been statistically significant (Tamhane, 2009). Consequently, if a researcher is willing to tout 
the relevance of any relationship that happens to turn up, then the opportunity for Type I error 
is inflated by every spurious relationship that could potentially appear. Thus, it is true that 
controlling Type I error for all conceivable tests (e.g ,,all possible comparisons ) typically 
requires more severe adjustment (and hence “costs” more in statistical power) than controlling 
Type I error for only a predetermined subset of tests (Cohen, Cohen, West, & Aiken, 2003; 
Hsu, 1996). But unfortunately, that truth seems to have been distorted into the myth that 
planned comparisons do not require adjustment at all. 

Ryan (1995) blamed this confusion partly on ambiguous use of the term post hoc, which 
means “formulated after the fact.” For example, the phrase post hoc tests is often used to mean 
unplanned tests (i.e., tests conceived post data-collection), but is sometimes used to mean 
multiple tests in general (especially multiple tests conducted following an omnibus-test). This 
equivocation may lead some to believe that multiple testing is only of concern for unplanned 
tests—a confusion that is perhaps reinforced by statistical software, such as SPSS, that list all 
multiplicity adjustments, including the Bonferroni procedure, as “post hoc” options (Howell, 
2013). 


4. Rationalizations for PHEMA 

4.1. Greater Importance of Planned Tests 

Keppel and Zedeck (1989, p. 172) noted that PHEMA “is generally defended by the argument 
that planned comparisons typically constitute the primary purpose of a study, and as such, they 
should be subjected to the most sensitive statistical test possible.” However, this approach 













allows the most important questions (i.e., “the primary purpose” of the study) to be 
investigated with the least rigor (i.e., with minimal control of Type I error). Moreover, using 
“the most sensitive statistical test possible” only makes sense under the constraint that Type I 
error is controlled. Otherwise, why not set the alpha level at .99 rather than at .05? After all, if 
Type I error control is not of concern, then any test can be made more “sensitive” (i.e., more 
statistically powerful) simply by raising the alpha level. A better way to achieve adequate 
statistical power would be to invest in a larger sample size. 

Incidentally, if a study involves one planned test of primary importance and multiple tests of 
somewhat lesser interest, there is a simple way to control the FWER without reducing the 
sensitivity of the primary test: 

Step 1: Conduct the primary test at the unadjusted alpha level. 

Step 2: If the primary test is significant, then conduct the secondary tests using testwise alpha 
levels adjusted for the number of secondary tests. But if the primary test is not significant, then 
forfeit the significance of the secondary tests. Note that when using this method, the testing 
order and conditionality should be explicitly outlined a priori in a registered study protocol. 

4.2. Greater Credibility of Planned Tests 

Another common rationale for PHEMA is that a priori predictions are presumably logical 
extensions of extant knowledge and are therefore more likely to be correct (Abelson, 1995; 
Anderson, 2014; Ha & Ha, 2012; McHugh & Ellis, 1957; Rutherford, 2011). One textbook 
advised the following: “Because you have preplanned these comparisons, typically based on 
prior data and theory, and you do not plan to do all possible comparisons, you are not required 
to make a correction for your alpha (a) level” (Ha & Ha, 2012, p. 206, emphasis in original). 
However, that appears to be a non sequitur. It may be true that a group of predictions are 
generally more likely to be correct if they have some theoretical basis, but the same would be 
true of a single prediction. Thus, why should “preplanning” excuse relaxed Type I error 
control for multiple tests if preplanning would not excuse relaxed Type I error control for one 
test? 

5. Dissemination of PHEMA: An Example 

Even a patently false heuristic such as PHEMA can become popular if it tells people what they 
want to hear, for example, that multiple tests may be conducted without sacrificing statistical 
power. For instance, Pemeger’s (1998) manifesto against multiplicity adjustments, which 
promoted PHEMA and numerous other misunderstandings (as noted by Aickin, 1999; Bender 
& Lange, 1998; Goeman & Solari, 2014), has been cited by over 3,000 articles as of this 
writing—and the majority of those articles were published in 2010 or later (as per Google 
Scholar). One such article defended its unadjusted tests as follows: 

Because we were testing specific hypotheses, we performed planned comparisons, 
which, unlike post hoc tests, do not need to be adjusted. In light of criticism in the 
literature levelled at Bonferroni and other corrections (e.g., Pemeger, 1998), the 
analyses were performed without adjustment. (Roche & Chainay, 2013, p. 1017) 

Sijbrandij, Engelhard, Lommen, Leer, & Baas (2013) offered a similar justification for their 


unadjusted tests, also citing Pemeger: “Since pre-specified hypotheses were tested, no formal 
corrections for multiple comparison [sic] were carried out (Pemeger, 1998)” (p. 1993). For 
other PHEMA-based statements citing Pemeger, see Askari, Kirby, Parker, Thompson, & 
O’Neill (2013), Clifford et al. (2012), Fenesi, Fleisz, Savage, Shore, & Kim (2014), Kawai et 
al. (2014), Krane-Gartiser, Flenriksen, Morken, Vaaler, & Fasmer (2014), Lau, Lin, & Flores 
(2012), Weisse et al. (2013), and many others. 

6. Variations on PHEMA 

6.1. Constraining PHEMA to Orthogonal Contrasts 

Many textbooks have suggested that although multiplicity may be of concern for some planned 
tests, multiplicity is not of concern for planned orthogonal contrasts (Abdi & Williams, 2010; 
Brown, 1990; Cohen, 2013; Cohen et al., 2003; Doncaster & Davey, 2007; Kirk, 2013; 
Pedhazur & Schmelkin, 1991; Randolph & Meyers, 2013; Zieffler, Flarring, & Long, 2011). In 
fact, some researchers have explicitly defended their unadjusted comparisons on that basis 
(e.g., Harkness & Luther, 2001; Nam & Zellner, 2011; Nieuwenhuis, Folia, Forkstam, Jensen, 
& Petersson, 2013). 

The reasoning for this version of PHEMA may be summarized as follows (Abdi & Williams, 
2010, p. 248): “Planned orthogonal contrasts are equivalent to independent questions asked to 
the data. Because of that independence, the current procedure is to act as if each contrast were 
the only contrast tested” (see also Thompson, 1994). However, this rationale appears to depend 
on equivocal use of the word “independence”: Statistical independence (i.e., mutual 
orthogonality) among the tests does not imply that each result should be interpreted 
“independently” (i.e., without regard to how many other tests were conducted). 

In fact, the FWER is higher for orthogonal tests than for positively dependent tests. 
Specifically, the maximum FWER for unadjusted tests monotonically diminishes from 

1 - (1 - a) m to a as the correlation among the tests increases from 0 to 1, where a is the 
designated alpha level and m is the number of tests. Thus, not only is adjustment for 
multiplicity potentially important for orthogonal contrasts (Bechofer & Dunnett, 1982), one 
could argue that it is especially important for orthogonal contrasts. Incidentally, the maximum 
FWER can be higher for negatively dependent tests than for orthogonal tests, but typically 
only marginally so, and negative dependence is generally not plausible for two-sided tests. 

6.2. Constraining PHEMA to Small Numbers of Hypotheses 

Another variation on PHEMA asserts that multiplicity may be disregarded for planned tests 
provided that the number of tests is sufficiently small. Limiting the number of unadjusted tests 
that may be excused by PHEMA is often recognized as necessary “because otherwise, the 
researcher could delineate a very long list of contrasts and claim them all as planned” 
(Iacobucci, 2001, p. 7). 

For multigroup designs, some authors have set the maximum number of unadjusted 
comparisons at one less than the number of groups (e.g., Keppel & Zedeck, 1989; Tabachnick 
& Fidell, 2012). This limit is equal to the maximum number of orthogonal contrasts and also 
equal to the number of numerator degrees of freedom that would be available in an omnibus 
test. Other proposed limits on the number of unadjusted tests have been less precise, e.g., a 


“small number” (Armstrong, 2014, p. 505; Hays, 1988, p. 411; Helweg-Larsen & Nielsen, 
2009, p. 91; McKillup, 2012, p. 163; Streiner & Norman, 2011, p. 18), or a “low” number 
(Baguley, 2012, p. 491), or “few” (Pagano, 2013, p. 402; Welkowitz, Cohen, & Lea, 2012, p. 
364). However, all of these proposed constraints are overly permissive of Type I error 
inflation, given that even going from one test to two tests without adjustment can roughly 
double the PFER and FWER. 

Moreover, allowing more Type I error inflation for a small number of tests than for a large 
number of tests is arbitrary and logically inconsistent. For instance, suppose that if there are 
only three tests, then it is deemed acceptable not to adjust for multiplicity, but that if there are 
ten tests, then FWER control is deemed necessary. Assuming an unadjusted alpha level of .05, 
the maximum FWER for three tests is roughly .14. But if .14 is an acceptable FWER for three 
tests, then why should .14 not be an acceptable FWER for ten tests? That is, why insist that the 
Type I error rate for one test should be controlled at .05, and that the FWER for ten tests 
should also be controlled at .05, but that the FWER for three tests may be controlled at .14? 

6.3. Reverse-PHEMA 

Some authors have proposed the opposite of PHEMA: that planned tests require multiplicity 
adjustment and that unplanned tests are exempt (e.g., Rovai, Baker, & Ponton, 2014, p. 256). 
This heuristic, which is no more mathematically justifiable than PHEMA, is perhaps based on 
an assumption that unplanned tests are typically exploratory (i.e., not confirmatory) and 
therefore require less rigorous control of Type I error. However, even exploratory analyses 
often require some form of multiplicity adjustment, as one would not want to waste resources 
following up on an excessive number of spurious preliminary findings (Tamhane, 2009). It is 
true that in some unplanned testing scenarios, the number of implicit tests may be 
indeterminate, making formal multiplicity adjustment impossible (Bender & Lange, 2001). 
However, in such contexts, /^-values can only serve a descriptive function and should not be 
interpreted—or reported—as if they are hypothesis test results. 

7. Conclusions 

There is considerable concern in the sciences about poor replicability of published findings 
and what is perceived as a high prevalence of false discoveries (Pashler & Wagenmakers, 
2012). Adequate control of Type I error inflation directly relates to those issues and is 
essential to good research practice and scientific soundness (Benjamini, 2010; Bretz & 
Westfall, 2014; Hsu, 1996). False heuristics such as PHEMA, that discourage thoughtful 
handling of multiplicity, are therefore a nontrivial hindrance to research quality. 

That is not to say that PHEMA necessarily reflects the dominant view among researchers. For 
example, in confirmatory trials to demonstrate drug efficacy, comparisons are typically 
required to be both prespecified in the study protocol and adjusted for any multiplicity 
(European Agency for the Evaluation of Medicinal Products, 2002; U.S. Department of Health 
and Human Services, 1998). But given that so many respected textbooks have endorsed 
PHEMA in one form or another, and given that so many recent articles have used PHEMA to 
justify forgoing multiplicity adjustment, it is evident that awareness, education, and standards 
of practice regarding this issue need improvement. Therefore, although the present article is 
not the first to criticize PHEMA (e.g., see Ryan, 1959, 1995), it aims to provide the most 


thorough refutation of PHEMA and its variations. 


7.1. Recommendations for Researchers 

(a) Avoid using PHEMA as an excuse for unadjusted (or under-adjusted) tests. In some cases, 
there may be a legitimate reason not to adjust—but PHEMA is not such a reason. Note that the 
mere fact that subjectivities and disagreements about multiple testing exist does not mean that 
the problem may be disregarded or that all statements about the problem are equally valid. 

(b) Select an error rate appropriate for the type of inference required. For example, PFER 
control is appropriate when the veracity of each claimed discovery is highly important, 
whereas FDR control provides more statistical power and may be preferable when it is 
sufficient merely to have an adequate preponderance of correct discoveries (e.g., when 
screening through a large number of associations to generate hypotheses for future study). In 
terms of stringency, FWER control occupies a middle ground between the other two rates: It 
considers avoiding even one Type I error important, but considers multiple simultaneous Type 
I errors to be no more worrisome than a single Type I error. 

(c) As recommended by the American Psychological Association (2012) and by other sources 
(including a previous article in this journal; Tromovitch, 2012), report precise p -values rather 
than merely reporting “p < .05,” so that readers requiring a different level of inference can 
apply an alternative approach. Note also that confidence intervals are generally more 
informative than p-values alone, given that the size of the effect—not merely whether the 
effect is different from zero—is presumably important in most cases. 

(d) Regardless of which approach to Type I error control is used, report the number of tests 
conducted (including those implicitly conducted when “fishing” through the data for 
significance), the structure of the testing (e.g., which comparisons were of primary and 
secondary interest a priori), and why the chosen approach to Type I error control was deemed 
appropriate for the study. Statistical power analysis is often valuable as well, especially when 
nonsignificant results are potentially interesting. When possible, all this information should be 
preregistered in a study protocol (or similar document) before the study begins—which 
typically should be no problem for analyses that truly are “planned.” 

7.2. Recommendations for Professors and Textbook Authors 

(a) Refrain from perpetuating PHEMA, and explicitly refute PHEMA when presenting the 
concept of multiplicity or when distinguishing between planned and unplanned tests. 

(b) Be wary of the term post hoc, which has become ambiguous through misuse. In fact, Ryan 
(1995) recommended that the term not be used at all in the context of hypothesis testing. The 
word exploratory may also be problematic: The term generally means “not confirmatory,” but 
is often used as a synonym for “unplanned” when describing a data analysis—even though 
planned tests can be exploratory also, especially in early stages of research. 

(c) When discussing how statistical procedures should be applied, emphasize the fundamental 
goals of those procedures. For example, the purpose of null hypothesis testing is to limit the 
rate at which scientific harm is caused by false discoveries, and the purpose of multiplicity 
adjustments is to preserve that limit when there are multiple simultaneous opportunities for 


scientific harm. If these basic goals are understood, then it is easy to recognize that whether the 
tests were planned or not is irrelevant to those goals—a planned opportunity is an opportunity 
nonetheless. 
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